2022-10-03

Präsentation

Goal:

Train a Machine learning Algorithm to predict Enzyme Comission Number (EC -number ) from Protein Sequence.

EC - Numbers

Class Reaction catalyzed Typical reaction Enzyme example
EC 1 Oxidoreductases Oxidation/reduction reactions; transfer of H and O atoms or electrons from one substance to another AH + B → A + BH (reduced) A + O → AO (oxidized) Dehydrogenase, oxidase
EC 2 Transferases Transfer of a functional groupf from one substance to another. The group may be methyl-, acyl-, amino- or phosphate group AB + C → A + BC Transaminase, kinase
EC 3 Hydrolases Formation of two products from a substrate by hydrolysis AB + H2O → AOH + BH Lipase, amylase, peptidase, phosphatase
EC 4 Lyases Non-hydrolytic addition or removal of groups from substrates. C-C, C-N, C-O or C-S bonds may be cleaved RCOCOOH → RCOH + CO2 or [X-A+B-Y] → [A=B + X-Y] Decarboxylase
EC 5 Isomerases Intramolecule rearrangement, i.e. isomerization changes within a single molecule ABC → BCA Isomerase, mutase
EC 6 Ligases Join together two molecules by synthesis of new C-O, C-S, C-N or C-C bonds with simultaneous breakdown of ATP X + Y + ATP → XY + ADP + Pi Synthetase
EC 7 Translocases Catalyse the movement of ions or molecules across membranes or their separation within membranes Transporter

(from https://www.abbexa.com/enzyme-commission-number altered)

Challenges

  • SAR -paradoxon: SAR: _Structure–Activity Relationship … refers to the fact that it is not the case that all similar molecules have similar activities.

  • Sequence > Structure > Function

  • Sequence > (ML) > Function

Input Data:

Creation of the training dataset:

To work on the implementation of the algorithm we queried the Uniprot Database for reviewed human enzyme sequences with know EC numbers, using their REST-Api.

Creation of larger dataset:

To create a larger training and validation dataset we used the training and validation data that had been used in the creation of ECPred a Java application for ECN prediction. The benefits in using this dataset was that it is already curated for machine learning, is very vast. The drawback was that the repository only contains UniProt accession numbers and not the sequences of proteins. For this reason we had to download the sequences and build the dataset ourselfs afterwards.

To keep the scope of the project reasonlable we focused only on the prediction of the main enzyme identifiier number EC{1,2,3,4,5,6}

Variables

  • Amino acid composition ✅
  • molar refractivity 🟩
  • polar surface area 🟩
  • partition coefficient 🟩
  • fingerprints 🟩

Input Data:

# create directories
for(i in c("uniprot","expasy")){
dir.create(path = paste0("data/raw/",i),recursive = TRUE)
}
# Download fasta files using wget:
for( i in 1:7){
  url<-paste0("https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28taxonomy_id%3A9606%29%20AND%20%28ec%3A",i,"%29%29%20AND%20%28reviewed%3Atrue%29")
  dest<-paste0("data/raw/uniprot/EC_",i,"_HomoSapiens")
  download.file(url,method = 'wget',destfile = dest)
  }
list.files("data/raw/uniprot/")

Read Fasta

# read fasta Files using seqinr- package:
EC_1<-seqinr::read.fasta("data/raw/uniprot/EC_1_HomoSapiens",
                        seqtype = "AA",
                        seqonly = FALSE)
EC_2<-seqinr::read.fasta("data/raw/uniprot/EC_1_HomoSapiens",
                        seqtype = "AA",
                        seqonly = TRUE)
# -> this gives us a list of S3 types "SeqFastaAA"
summary(EC_1)|>head(5)|>knitr::kable()

Aminoacid composition

SeqFastaAA_list_to_matrix <- function(SeqFastaAA_list) {
  statslist<-empty.dump()
## get AA-Composition list from fasta:
  for(i in 1:length(SeqFastaAA_list)){
    statslist[[i]]<-AAstat(SeqFastaAA_list[[i]],plot = FALSE)$Compo
  }

## Set names: 
  names(statslist)<-names(SeqFastaAA_list)
## transform to matrix:
  AA_matrix<-do.call(rbind,statslist)
## use three letter annotation:
  colnames(AA_matrix)[-1]<-aaa(colnames(AA_matrix)[-1])
  return(AA_matrix)
}

check Values

SeqFastaAA_list_to_matrix(EC_1)|>knitr::kable()
SeqFastaAA_list_to_matrix(EC_1)|>head(5)|>rowSums()|>knitr::kable()

CODE for all EC numbers:

EC_list<-empty.dump()
for(i in 1:7){
 EC_list[[i]]<-seqinr::read.fasta(
   paste0("data/raw/uniprot/EC_",i,"_HomoSapiens"),
                        seqtype = "AA",
                        seqonly = FALSE)
}
EC_matrix_list<-lapply(EC_list,SeqFastaAA_list_to_matrix)
# Add EC number:
names(EC_matrix_list)<-c(paste("EC",1:7))
## I call cbind.data.frame to keep the cells as int (cbind converts to character)
for(i in 1:7){
EC_matrix_list[[i]]<- cbind.data.frame( EC_matrix_list[[i]],"EC"=names(EC_matrix_list)[i])
}
# Bind rows to one big df:
comp_matrix<-do.call(rbind,EC_matrix_list)
saveRDS(comp_matrix,"data/comp_matrix_HSapiens.rds")

A look at the data

comp_matrix<-readRDS("data/comp_matrix_HSapiens.rds")
comp_matrix$EC<-as.factor(comp_matrix$EC)
head(comp_matrix)
##                                * Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met
## EC 1.sp|A0A087X1C5|CP2D7_HUMAN 0  42   9  23  28  31  36  16  16  15  66  11
## EC 1.sp|A1KZ92|PXDNL_HUMAN     0 111  46  76  80  58  91  48  63  52 145  23
## EC 1.sp|A2RUC4|TYW5_HUMAN      0  20   3  21  23  20  15   8  13  26  32   5
## EC 1.sp|A5PLL7|PDES1_HUMAN     0  20   8  12  17  13  18  16  18  10  35   3
## EC 1.sp|B2RXH2|KDM4E_HUMAN     0  33  12  17  31  20  45  20  18  20  41  15
## EC 1.sp|C9JRZ8|AK1BF_HUMAN     0  21   3  19  27  21  14  10  19  31  27   7
##                                Asn Pro Gln Arg Ser Thr Val Trp Tyr   EC
## EC 1.sp|A0A087X1C5|CP2D7_HUMAN  11  40  23  41  25  25  44   7   6 EC 1
## EC 1.sp|A1KZ92|PXDNL_HUMAN      62  96  79 103  95  92  87  21  35 EC 1
## EC 1.sp|A2RUC4|TYW5_HUMAN        8  17  15  18  19  10  23   4  15 EC 1
## EC 1.sp|A5PLL7|PDES1_HUMAN       7  14   9  14   9  15  13  13   6 EC 1
## EC 1.sp|B2RXH2|KDM4E_HUMAN      18  32  30  30  35  35  24  10  20 EC 1
## EC 1.sp|C9JRZ8|AK1BF_HUMAN      10  17  11  11  14  16  22   6  10 EC 1

A look at the data -2

### Plot number of enzymes per group
comp_matrix %>% group_by(EC) %>% tally() %>% ggplot(aes(EC,n))+geom_col()

A look at the data -3

length of sequence per group

#### Plot length of sequence per group
AA_len<- cbind("EC"=comp_matrix$EC,"nAA"=rowSums(comp_matrix[1:21]))
boxplot(nAA ~ EC,data = AA_len) 

## -> big outlier in EC 2

A look at the data -4

PCR - just for fun…

EC 2.sp

Q8WZ42|TITIN_HUMAN (ein strukturprotein?)

Key component in the assembly and functioning of vertebrate striated muscles. By providing connections at the level of individual microfilaments, it contributes to the fine balance of forces between the two halves of the sarcomere. The size and extensibility of the cross-links are the main determinants of sarcomere extensibility properties of muscle. In non-muscle cells, seems to play a role in chromosome condensation and chromosome segregation during mitosis. Might link the lamina network to chromatin or nuclear actin, or both during interphase.

comp_matrix %>% filter(row.names(comp_matrix) %in% c("EC 2.sp|Q8WZ42|TITIN_HUMAN"))
##                            *  Ala Cys  Asp  Glu Phe  Gly His  Ile  Lys  Leu Met
## EC 2.sp|Q8WZ42|TITIN_HUMAN 0 2084 513 1720 3193 908 2066 478 2062 2943 2117 398
##                             Asn  Pro Gln  Arg  Ser  Thr  Val Trp Tyr   EC
## EC 2.sp|Q8WZ42|TITIN_HUMAN 1111 2517 942 1640 2463 2546 3184 466 999 EC 2
# just a giant enzyme....... -----

Modeling Methods

Naive approach

only Amino acid composition as features baseline

AA composition and protein statistics

  • number of residues
  • pysio-chemical classes
  • isoelectricity
library(protr)

ids <- c("P00750", "P00751", "P00752")
prots <- getUniProt(ids)
prots <- removeGaps(prots)
prots <- subset(prots,unlist(lapply(prots, protcheck)))
prots# remove problematic sequences
prots <- lapply(prots,extractAAC ) #calcualte AAC
rm -r ECPred
git clone --depth 1 --filter=blob:none --sparse https://github.com/cansyl/ECPred
cd ECPred
git sparse-checkout set ECPred\ Datasets
cd ./ECPred/ECPred\ Datasets
#rm -rf !(EC_Main*)
tree
## .
## ├── EC_MainClass_NegativeTrain
## │   ├── Hydrolases_Negative_Train.txt
## │   ├── Isomerases_Negative_Train.txt
## │   ├── Ligases_Negative_Train.txt
## │   ├── Lyases_Negative_Train.txt
## │   ├── Oxidoreductases_Negative_Train.txt
## │   └── Transferases_Negative_Train.txt
## ├── EC_MainClass_NegativeValidation
## │   ├── Hydrolases_Negative_Validation.txt
## │   ├── Isomerases_Negative_Validation.txt
## │   ├── Ligases_Negative_Validation.txt
## │   ├── Lyases_Negative_Validation.txt
## │   ├── Oxidoreductases_Negative_Validation.txt
## │   └── Transferases_Negative_Validation.txt
## ├── EC_MainClass_PositiveTrain
## │   ├── Hydrolases_Positive_Train.Txt
## │   ├── Isomerases_Positive_Train.txt
## │   ├── Ligases_Positive_Train.txt
## │   ├── Lyases_Positive_Train.txt
## │   ├── Oxidoreductases_Positive_Train.txt
## │   └── Transferases_Positive_Train.txt
## └── EC_MainClass_PositiveValidation
##     ├── Hydrolases_Positive_Validation.txt
##     ├── Isomerases_Positive_Validation.txt
##     ├── Ligases_Positive_Validation.txt
##     ├── Lyases_Positive_Validation.txt
##     ├── Oxidoreductases_Positive_Validation.txt
##     └── Transferases_Positive_Validation.txt
## 
## 4 directories, 24 files

  • LDN
  • random forest
  • knn
  • svm
  • CTD Descriptors - Composition
  • physico-chemical classes
  • polar surface area 🟩